-
Notifications
You must be signed in to change notification settings - Fork 15k
[AMDGPU] siloadstoreopt generate REG_SEQUENCE with aligned operands #162088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: Janek van Oirschot (JanekvO) ChangesStop generating misaligned operands (misaligned compared to the dst alignment) Full diff: https://github.com/llvm/llvm-project/pull/162088.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index f0d1117664983..1185fbf325932 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -1363,6 +1363,15 @@ SILoadStoreOptimizer::checkAndPrepareMerge(CombineInfo &CI,
return Where;
}
+static bool isCompatibleAlignSubReg(const TargetRegisterClass *RC, unsigned Idx,
+ const GCNSubtarget *STM,
+ const SIRegisterInfo *TRI) {
+ if (!STM->needsAlignedVGPRs() || !TRI->isVGPRClass(RC) ||
+ !TRI->isProperlyAlignedRC(*RC) || AMDGPU::getRegBitWidth(*RC) == 32)
+ return true;
+ return !(TRI->getChannelFromSubReg(Idx) & 0x1);
+}
+
// Copy the merged load result from DestReg to the original dest regs of CI and
// Paired.
void SILoadStoreOptimizer::copyToDestRegs(
@@ -1411,12 +1420,24 @@ SILoadStoreOptimizer::copyFromSrcRegs(CombineInfo &CI, CombineInfo &Paired,
const auto *Src0 = TII->getNamedOperand(*CI.I, OpName);
const auto *Src1 = TII->getNamedOperand(*Paired.I, OpName);
- BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
- .add(*Src0)
- .addImm(SubRegIdx0)
- .add(*Src1)
- .addImm(SubRegIdx1);
-
+ // Make sure the generated REG_SEQUENCE has sensibly aligned registers.
+ if (isCompatibleAlignSubReg(Paired.DataRC, SubRegIdx1, STM, TRI)) {
+ BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
+ .add(*Src0)
+ .addImm(SubRegIdx0)
+ .add(*Src1)
+ .addImm(SubRegIdx1);
+ } else {
+ auto BMI =
+ BuildMI(*MBB, InsertBefore, DL, TII->get(AMDGPU::REG_SEQUENCE), SrcReg)
+ .add(*Src0)
+ .addImm(SubRegIdx0);
+ for (unsigned i = 0; i < Paired.Width; ++i) {
+ unsigned PreviousChannel = TRI->getChannelFromSubReg(SubRegIdx0);
+ BMI.addReg(Src1->getReg(), /*flags=*/0, TRI->getSubRegFromChannel(i))
+ .addImm(TRI->getSubRegFromChannel(PreviousChannel + i + 1));
+ }
+ }
return SrcReg;
}
diff --git a/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir b/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
index a67cf22bdd1ce..b45b69010141e 100644
--- a/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
+++ b/llvm/test/CodeGen/AMDGPU/merge-flat-with-global-load-store.mir
@@ -213,7 +213,7 @@ body: |
; GCN: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; GCN-NEXT: [[DEF2:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
- ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]], %subreg.sub1_sub2
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_96_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]].sub0, %subreg.sub1, [[DEF2]].sub1, %subreg.sub2
; GCN-NEXT: FLAT_STORE_DWORDX3 [[DEF]], killed [[REG_SEQUENCE]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s96) into `ptr poison`, align 4)
%0:vreg_64_align2 = IMPLICIT_DEF
%1:vgpr_32 = IMPLICIT_DEF
@@ -230,7 +230,7 @@ body: |
; GCN: [[DEF:%[0-9]+]]:vreg_64_align2 = IMPLICIT_DEF
; GCN-NEXT: [[DEF1:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; GCN-NEXT: [[DEF2:%[0-9]+]]:vreg_96_align2 = IMPLICIT_DEF
- ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]], %subreg.sub1_sub2_sub3
+ ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_128_align2 = REG_SEQUENCE [[DEF1]], %subreg.sub0, [[DEF2]].sub0, %subreg.sub1, [[DEF2]].sub1, %subreg.sub2, [[DEF2]].sub2, %subreg.sub3
; GCN-NEXT: FLAT_STORE_DWORDX4 [[DEF]], killed [[REG_SEQUENCE]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s128) into `ptr poison`, align 4)
%0:vreg_64_align2 = IMPLICIT_DEF
%1:vgpr_32 = IMPLICIT_DEF
|
| ; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY [[DEF1]].sub1 | ||
| ; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[DEF1]].sub0 | ||
| ; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:vreg_64_align2 = REG_SEQUENCE [[COPY]], %subreg.sub1, [[COPY1]], %subreg.sub0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression, you do not need these new copies
| if (CompatRC0 != CI.DataRC) | ||
| BuildMI(*MBB, InsertBefore, DL, TII->get(TargetOpcode::COPY), Src0Reg) | ||
| .add(*Src0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need new copies, and this is checking against the wrong register class? DataRC should already be a subregister class of SuperRC. Also should prefer to try constrainRegClass before falling back to copy, but I don't think that will happen in any real case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is checking against the wrong register class?
Yeah, needs a better compatibility check
DataRC should already be a subregister class of SuperRC
I assumed so but might not be the case if the operand register has subreg idx
try constrainRegClass before falling back to copy
From trying it out, it seems to always use the aligned variant (i.e., constrainRegClass(%1:vreg_64_align2, VReg_64) will always continue using vreg_64_align2 instead of forcing VReg_64) so may require COPY. Might just try to construct the REG_SEQUENCE in vgpr_32 parts only, is probably easier as well for further optimizations afterwards. (do let me know if I'm missing low hanging fruit on this end, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed so but might not be the case if the operand register has subreg idx
Mostly no, but you should probably still call use getMatchingSuperRegClass to find the common case. I still think this check is just wrong, none of the copies in the tests should be necessary.
Also need to add new case where you do need to copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually for this I think you should always be able to constrain, no copies required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually for this I think you should always be able to constrain, no copies required
Am I do to something direct like setRegClass for this? Because trying to constrain an align2 to non-align2 equivalent (e.g., constrainRegClass(%1:vreg_64_align2, VReg_64)) will always fallback on the align2 variant which is exactly what I don't want with the generation of REG_SEQUENCE operands here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't be trying to go that direction? You cannot simply relax register classes without accounting for all the uses of the register. But transforms should generally work in a constraint increasing direction to enable folds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the new test that produced the unaligned use? I'm not really sure how you ended up with this problem in the first place
It'll be similar to some of the tests I'm trying to add in #160547 (similarly, wont be able to reproduce end-to-end as I reverted #154115 for now as I try fixing these vgpr alignment issues) |
For the purposes of this PR, you don't need to. You should still add the pre-merge MIR here that induced the unaligned use after |
Stop generating misaligned operands (misaligned compared to the dst alignment)